Project: Restaurant market of Los Angeles

Alexander Feldman V.1.0

This project is carried out with the aim of opening a new robot-run cafe in Los Angeles.
Research objectives:

  • Сonduct research on the local restaurant market.
  • Determine the main characteristics of the new cafe.
  • Create an attractive offer for investors.

1. Preparing the data

In [1]:
#!pip install plotly --upgrade
#!pip install pandas_profiling
#!pip install usaddress
In [2]:
import pandas as pd
import numpy as np
import scipy.stats as st
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly import graph_objects as go
import plotly.figure_factory as ff
import pandas_profiling as pp
import usaddress
In [3]:
path_local_work = 'datasets/rest_data_us.csv'
path_platform_work = '/datasets/rest_data_us.csv'
try:
    rests = pd.read_csv(path_platform_work)
except: 
    rests = pd.read_csv(path_local_work)
In [4]:
#pp.ProfileReport(rests)

The quality of data is fine. There are three missing values in chain column

In [5]:
# Rename columns
rests.columns=['id', 'name', 'address', 'chain', 'type', 'seats']
rests.head()
Out[5]:
id name address chain type seats
0 11786 HABITAT COFFEE SHOP 3708 N EAGLE ROCK BLVD False Cafe 26
1 11787 REILLY'S 100 WORLD WAY # 120 False Restaurant 9
2 11788 STREET CHURROS 6801 HOLLYWOOD BLVD # 253 False Fast Food 20
3 11789 TRINITI ECHO PARK 1814 W SUNSET BLVD False Restaurant 22
4 11790 POLLEN 2100 ECHO PARK AVE False Restaurant 20
In [6]:
# Fill missing values as False (not chain), because studying the name of objects say us that it rather
# not chain establishments
rests['chain'] = rests['chain'].fillna(False)
In [7]:
rests['address'] = rests['address'].str.upper()

2. Data analysis

2.1. Proportions of the various types of establishments

Investigate the proportions of the various types of establishments. Plot a graph.

In [8]:
group_types = rests.groupby('type', as_index=False)['id'].count()
fig = px.pie(group_types, values='id', names='type', title='Proportions of the various types of establishments',
            color_discrete_sequence=px.colors.qualitative.Set1)
fig.show()

As we can see from the graph more than 75% of establishments are restaurants. In the second place is fast food - 11%. The remains have shares of less than 5%.

2.2. Proportions of chain and nonchain establishments

Investigate the proportions of chain and nonchain establishments. Plot a graph.

In [9]:
group_chain = rests.groupby('chain', as_index=False)['id'].count()
group_chain['chain'] = group_chain['chain'].replace({False:'Nonchain', True:'Chain'})
fig = go.Figure(data=[go.Pie(labels=group_chain['chain'],
                             values=group_chain['id'],
                             textinfo='label+percent',
                             insidetextorientation='radial',
                             hole=.3
                            )])
fig.update_traces(marker=dict(colors=['rgb(255,127,0)','rgb(166,86,40)']))

fig.update_layout(title='Proportions of chain and nonchain establishments', showlegend=False)
fig.show()

Almost 40% of places are chain. Globalization is coming.

2.3. Chain establishment types

Which type of establishment is typically a chain?

In [10]:
chain_rests = rests[rests['chain'] == True]
chain_rests_types = chain_rests.groupby('type', as_index=False)['id'].count().sort_values('id', ascending=False)
In [11]:
fig = px.bar(chain_rests_types, x='type', y='id', title='Chain establishment types', color='type',
            labels={'type':'Type of establishment', 'id':'Number of establishments'}, text='id',
            color_discrete_sequence=px.colors.qualitative.Set1)
fig.update_traces(texttemplate='%{text}', textposition='outside', showlegend=False)
fig.show()

In general for chains also save the trend for proportions of various types of establishments. However, the share of bakeries is higher for chains and the share of bars is lower.

2.4. Chain characteristics

What characterizes chains: many establishments with a small number of seats or a few establishments with a lot of seats?

In [12]:
# Build the graph
x_values = pd.Series(range(0, len(chain_rests)))
fig = px.scatter(chain_rests, x=x_values, y='seats', color='type', title='Number of seats in chain establishments',
                labels={'x':'Count of establishments', 'seats':'Number of seats'},
                color_discrete_sequence=px.colors.qualitative.Set1)
fig.show()
In [13]:
# plot the histogram
hist_data = [chain_rests['seats']]
group_labels = ['Distribution of seats'] # name of the dataset
fig = ff.create_distplot(hist_data, group_labels)
fig.update_layout(title='Distribution of seats in chain establishments')
fig.show()
In [14]:
print('The average of number of seats for chains establishments is {:.1f}. The median value is {:.1f}'.format(
chain_rests['seats'].mean(), chain_rests['seats'].median())) 
The average of number of seats for chains establishments is 39.7. The median value is 25.0

As we can see on the scatter plot the range of the number of seats from 0 to 250. There is a clear division of density for several areas. The first area is less than 49 seats, the second one is from 50 to 150 seats and the third is more than 150 seats. The histogram also lets to notice the additional area is less than 29 seats.
Perhaps there are some features of the regulation of establishments with a different number of seats.
The average number of seats for chains establishments is 39.7. Since the distribution is not normal the median value has a significant difference. The median value is 25.0.

In [15]:
chain_rests['group_seats'] = np.where(chain_rests['seats']<=29,
                                      '<29',
                                     np.where((chain_rests['seats']>29) & (chain_rests['seats']<49),
                                             '30-49',
                                             np.where((chain_rests['seats']>49) & (chain_rests['seats']<149),
                                              '50-149', '>150')
                                             )
                                     )
group_seats = chain_rests.groupby('group_seats', as_index=False)['id'].count()
fig = px.pie(group_seats, values='id', names='group_seats',
             title='Proportions of the various number of seats of chain establishments',
            color_discrete_sequence=px.colors.qualitative.Set1)
fig.show()

The largest part of establishments has less than 29 seats (58,8%). On the 2d place, the area from 30 to 49 seats (22,3%). On the 3d and 4th, areas up to 149 seats and from 150 seats respectively (13,8% and 5,06%).

2.5. Average number of seats

Determine the average number of seats for each type of restaurant. On average, which type of restaurant has the greatest number of seats? Plot graphs.

In [16]:
# Plot the graph
plt.figure(figsize=(10,6))
ax = sns.barplot(x="type", y="seats", data=rests)
plt.title('Average number of seats by type of establishments', fontdict={'size':15})
plt.xlabel('Type of establishment')
plt.ylabel('Number of seats')
plt.gca().spines["top"].set_alpha(0.0)    
plt.gca().spines["bottom"].set_alpha(0.5)
plt.gca().spines["right"].set_alpha(0.0)    
plt.gca().spines["left"].set_alpha(0.5)  
plt.show()

As we can see from the bar chart, the average values for different types of places (both chain and conventional) are from 20 to 50 seats. Moreover, restaurants, bars and fast foods have an average value of 30 to 49 seats, and bakery and pizza cafes up to 29 seats.

2.6. Street names with restaurants

Put the data on street names from the address column in a separate column.

In [17]:
# make a function for separating streets from adress
def cleaning_street(raw):
# treat exceptions
    if raw.startswith('OLVERA'): 
        clean_street = 'OLVERA ST'
    elif raw.startswith('1033 1/2 LOS ANGELES'):
        clean_street = 'LOS ANGELES ST'
    elif raw.startswith('308 WESTWOOD PLZ'):
        clean_street = 'WESTWOOD PLZ'
    elif raw.startswith('5333 Z00'):
        clean_street = 'ZOO DR'
    elif raw.startswith('5333 ZOO'):
        clean_street = 'ZOO DR'    
    elif raw.startswith('724 VINE'):
        clean_street = 'VINE ST'    
# main process    
    else:
        raw_address=usaddress.parse(raw) 
        clean_street = ''
        for i in raw_address:
            if i[1] == 'StreetNamePreDirectional' or i[1] == 'StreetName' or i[1] == 'StreetNamePostType':
                clean_street = clean_street +' '+ str(i[0])
    return clean_street
rests['street'] = rests.address.apply(cleaning_street)
In [18]:
rests.sample(10)
Out[18]:
id name address chain type seats street
7731 19517 LA CIENEGA KABOB HOUSE 444 1/2 N LA CIENEGA BLVD True Restaurant 11 N LA CIENEGA BLVD
6277 18063 GORDEN BIERSCH 100 WORLD WAY False Restaurant 17 WORLD WAY
1609 13395 IL CAFFE 855 S BROADWAY False Restaurant 34 S BROADWAY
3981 15767 ESPERANZA'S BAKERY 243 W ADAMS BLVD True Bakery 5 W ADAMS BLVD
3287 15073 THE CORK 4771 W ADAMS BLVD False Restaurant 34 W ADAMS BLVD
8725 20511 GYUTARO 424 E 2ND ST False Restaurant 40 E 2ND ST
528 12314 THE HOLY GRILL 8975 W PICO BLVD False Restaurant 25 W PICO BLVD
9060 20846 ADVANCE FOOD MARKET 5469 W ADAMS BLVD True Restaurant 5 W ADAMS BLVD
4270 16056 THE REDWOOD BAR & GRILL 316 W 2ND ST False Restaurant 110 W 2ND ST
976 12762 LA DONUTS 1514 S VERMONT AVE STE A False Fast Food 5 S VERMONT AVE

2.7. Top ten streets by number of restaurants

Plot a graph of the top ten streets by number of restaurants.

In [19]:
streets_rest = rests.groupby('street', as_index=False)['id'].count()
streets_rest = streets_rest.rename(columns={'id':'count'})
top_ten_streets = streets_rest.sort_values('count', ascending=False).head(10)
In [20]:
fig = px.bar(top_ten_streets, x='count', y='street', title='Top ten streets by number of establishments',
             color='street', text='count', orientation='h',
            color_discrete_sequence=px.colors.qualitative.Set1)
fig.update_traces(texttemplate='%{text}', textposition='auto')
fig.show() 
In [21]:
print('The share of the top ten streets restaurants is {:.0%} from total number of restaurants of LA.'.format(
      top_ten_streets['count'].sum() / len(rests)))
The share of the top ten streets restaurants is 29% from total number of restaurants of LA.

The top 10 streets have a huge number of restaurants. There are almost 30% of the city's establishments.

2.8. Streets with only one restaurant

Find the number of streets that only have one restaurant.

In [22]:
streets_rest['one_rest'] = np.where(streets_rest['count']==1, 'Streets with only one place', 'Streets with many places')
print('We have {} streets with only one place from {} total number of streets'.format(
    streets_rest[streets_rest['count']==1]['street'].count(), len(streets_rest)))
We have 329 streets with only one place from 757 total number of streets
In [23]:
group_one = streets_rest.groupby('one_rest', as_index=False)['street'].count()
fig = px.pie(group_one, values = 'street', names='one_rest', color='one_rest',
             title='Proportions of the streets with only one place and many places',
             color_discrete_sequence=px.colors.qualitative.Set1)
fig.update_traces(textinfo='value + percent') 
fig.show()

As we can see from the pie chart, 43,5% of streets of LA have only one establishment.

2.9. The distribution of the number of seats

For streets with a lot of restaurants, look at the distribution of the number of seats. What trends can you see?

In [24]:
rests_top_streets = rests[rests['street'].isin(top_ten_streets['street'])]
In [25]:
# Plot the histogram of distribution of number seats for the top ten streets. 
plt.figure(figsize=(18,6))
sns.kdeplot(data=rests_top_streets, x='seats', hue='street', fill=True)
plt.title('The distribution of the number of seats for top 10 streets', fontdict={'size':15})
plt.xlabel('Number of seats')
plt.axvline(x=28, color='red', linestyle='--', alpha=.5)
plt.gca().spines["top"].set_alpha(0.0)    
plt.gca().spines["bottom"].set_alpha(0.3)
plt.gca().spines["right"].set_alpha(0.0)    
plt.gca().spines["left"].set_alpha(0.3)  
plt.show()
In [26]:
print('Tha average number of seats for the top ten streets establishments is {:.2f}. The median value is {:.2f}'.format(
rests_top_streets['seats'].mean(), rests_top_streets['seats'].median())) 
Tha average number of seats for the top ten streets establishments is 45.52. The median value is 28.00

To compare these results with other streets, let's divide the street into three groups:

  • Group A - the top ten streets (more or equel 213 establishments)
  • Group B - The streets with number of establishments from 21 to 213 establihments)
  • Group C - less or equal than 20 establishments.
In [27]:
rests['group'] = np.where(rests['street'].isin(streets_rest[streets_rest['count']>=213]['street']),
                          'A', 
                          np.where(rests['street'].isin(streets_rest[streets_rest['count']<=20]['street']),
                                   'C', 'B')
                          )
In [28]:
# make violin plot for three groups:
fig = px.violin(rests, y="seats", x="group", box=True, points="all", color='group',
               title='The distribution of the number of seats by streets group')
fig.show()

At first glance, the distributions are very similar to each other.

Make the test for calculating the statistical significance of average number of seats of establishments for difference groups of streets.
How we can see on the prievues graph, groups have a not normal distibutions. We should use Mann-Whitney U-test for testing.

Null hypothesis: The average number of seats between the groups are equal.
Alternative hypothesis: The average number of seats between the groups have a significant difference.

Since we test 3 hypotheses (A/B, A/C and B/C) apply the Bonferroni correction for alpha for multiple comparisons.

Alpha level = 0.05 / 3 = 0.01666

In [29]:
group_A = rests[rests['group']=='A']['seats']
group_B = rests[rests['group']=='B']['seats']
group_C = rests[rests['group']=='C']['seats']
In [30]:
# Make the Mann-Whitney A/B U-test
alpha = 0.01666
results = st.mannwhitneyu(group_A, group_B)
print('p-value: {0:.5f}'.format(results.pvalue))
if (results.pvalue < alpha):
    print('We reject the null hypothesis. The average number of seats of establishments of the groups are not equal')
else:
    print("We can't reject the null hypothesis that the average number of seats of establishments of the groups are equal")
print('The difference is {0:.3%}'.format(group_B.mean()/group_A.mean()-1))
p-value: 0.00026
We reject the null hypothesis. The average number of seats of establishments of the groups are not equal
The difference is -7.314%
In [31]:
# Make the Mann-Whitney A/C U-test
alpha = 0.01666
results = st.mannwhitneyu(group_A, group_C)
print('p-value: {0:.3f}'.format(results.pvalue))
if (results.pvalue < alpha):
    print('We reject the null hypothesis. The average number of seats of establishments of the groups are not equal')
else:
    print("We can't reject the null hypothesis that the average number of seats of establishments of the groups are equal")
print('The difference is {0:.3%}'.format(group_C.mean()/group_A.mean()-1))
p-value: 0.002
We reject the null hypothesis. The average number of seats of establishments of the groups are not equal
The difference is -1.760%
In [32]:
# Make the Mann-Whitney B/C U-test
alpha = 0.01666
results = st.mannwhitneyu(group_B, group_C)
print('p-value: {0:.3f}'.format(results.pvalue))
if (results.pvalue < alpha):
    print('We reject the null hypothesis. The average number of seats of establishments of the groups are not equal')
else:
    print("We can't reject the null hypothesis that the average number of seats of establishments of the groups are equal")
print('The difference is {0:.3%}'.format(group_C.mean()/group_B.mean()-1))
p-value: 0.448
We can't reject the null hypothesis that the average number of seats of establishments of the groups are equal
The difference is 5.992%
In [33]:
print('The mean value of A group = {:.2f}. The mean value of B group = {:.2f}. The mean value of C group = {:.2f}'.format(
group_A.mean(), group_B.mean(), group_C.mean()))
The mean value of A group = 45.41. The mean value of B group = 42.09. The mean value of C group = 44.61

As we can see from the test results, despite the close the calculated mean values, a statistically significant difference between the mean values is present between groups A and B and also between A and C. The mean values ​​for Group B and C can be considered equal.

3. Conclusions and recommendation

Results of research on the local restaurant market.

As a result of our research, we found out the following:

  • More than 75% of establishments are restaurants. In the second place is fast food - 11%. The remains have shares of less than 5%.
  • 38% of the market belongs to chains.
  • Almost 60% of chain establishments have up to 29 seats.
  • It looks like the number of seats in establishments affects the level of spending (< 29, 29-49, 50-150, >150).
  • On average, restaurants and bars have more seats than other establishments (up to 49).
  • Almost 30% of establishments in the city located on the top ten streets.
  • The average number of seats for the top ten streets more than other streets of LA.

Recomendations for the main characteristics of the new establishment.

It is worth starting a project by opening one RUN-ROBOT CAFE, but then developing a chain of establishments. Since we want to open an unusual place, we are more interested in the tourist areas and the city center. We need to strive to make our restaurants a tourist attraction.

For the first establishment recommend the following characteristics:

  • Type of establishment: restaurant
  • Location: streets with a number of establishments more than 20.
  • Number of seats: 29.

But for the second restaurant (after successfully working out the business model in the first restaurant), it makes sense to choose a location on the top 10 streets with up to 49 seats or up to 149 seats

Create an attractive offer for investors.

Recommend that the results of this stage of the research, as well as a description of the next stage, be included in the offer for investors.

4. Presentation for investors

In [ ]: